232 research outputs found

    Back-translation for discovering distant protein homologies

    Get PDF
    Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.Comment: The 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia : \'Etats-Unis d'Am\'erique (2009

    webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Phylogeny-aware progressive alignment has been found to perform well in phylogenetic alignment benchmarks and to produce superior alignments for the inference of selection on codon sequences. Its implementation in the PRANK alignment program package also allows modelling of complex evolutionary processes and inference of posterior probabilities for sequence sites evolving under each distinct scenario, either simultaneously with the alignment of sequences or as a post-processing step for an existing alignment. This has led to software with many advanced features, and users may find it difficult to generate optimal alignments, visualise the full information in their alignment results, or post-process these results, e.g. by objectively selecting subsets of alignment sites.</p> <p>Results</p> <p>We have created a web server called webPRANK that provides an easy-to-use interface to the PRANK phylogeny-aware alignment algorithm. The webPRANK server supports the alignment of DNA, protein and codon sequences as well as protein-translated alignment of cDNAs, and includes built-in structure models for the alignment of genomic sequences. The resulting alignments can be exported in various formats widely used in evolutionary sequence analyses. The webPRANK server also includes a powerful web-based alignment browser for the visualisation and post-processing of the results in the context of a cladogram relating the sequences, allowing (e.g.) removal of alignment columns with low posterior reliability. In addition to <it>de novo </it>alignments, webPRANK can be used for the inference of ancestral sequences with phylogenetically realistic gap patterns, and for the annotation and post-processing of existing alignments. The webPRANK server is freely available on the web at <url>http://tinyurl.com/webprank</url> .</p> <p>Conclusions</p> <p>The webPRANK server incorporates phylogeny-aware multiple sequence alignment, visualisation and post-processing in an easy-to-use web interface. It widens the user base of phylogeny-aware multiple sequence alignment and allows the performance of all alignment-related activity for small sequence analysis projects using only a standard web browser.</p

    Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes

    Get PDF
    Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices. Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins. The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.Comment: Table 9 in this article includes corrections for errata in the Table 9 published in 10.1371/journal.pone.0017244. Supporting information is attached at the end of the article, and a computer-readable dataset of the ML estimates of selective constraints is available from 10.1371/journal.pone.001724

    Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models

    Get PDF
    Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators

    A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes

    Get PDF
    GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available. © 2013 Capra et al

    A model-independent approach to infer hierarchical codon substitution dynamics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Codon substitution constitutes a fundamental process in molecular biology that has been studied extensively. However, prior studies rely on various assumptions, e.g. regarding the relevance of specific biochemical properties, or on conservation criteria for defining substitution groups. Ideally, one would instead like to analyze the substitution process in terms of raw dynamics, independently of underlying system specifics. In this paper we propose a method for doing this by identifying groups of codons and amino acids such that these groups imply closed dynamics. The approach relies on recently developed spectral and agglomerative techniques for identifying hierarchical organization in dynamical systems.</p> <p>Results</p> <p>We have applied the techniques on an empirically derived Markov model of the codon substitution process that is provided in the literature. Without system specific knowledge of the substitution process, the techniques manage to "blindly" identify multiple levels of dynamics; from amino acid substitutions (via the standard genetic code) to higher order dynamics on the level of amino acid groups. We hypothesize that the acquired groups reflect earlier versions of the genetic code.</p> <p>Conclusions</p> <p>The results demonstrate the applicability of the techniques. Due to their generality, we believe that they can be used to coarse grain and identify hierarchical organization in a broad range of other biological systems and processes, such as protein interaction networks, genetic regulatory networks and food webs.</p

    Selective Pressure to Increase Charge in Immunodominant Epitopes of the H3 Hemagglutinin Influenza Protein

    Get PDF
    The evolutionary speed and the consequent immune escape of H3N2 influenza A virus make it an interesting evolutionary system. Charged amino acid residues are often significant contributors to the free energy of binding for protein–protein interactions, including antibody–antigen binding and ligand–receptor binding. We used Markov chain theory and maximum likelihood estimation to model the evolution of the number of charged amino acids on the dominant epitope in the hemagglutinin protein of circulating H3N2 virus strains. The number of charged amino acids increased in the dominant epitope B of the H3N2 virus since introduction in humans in 1968. When epitope A became dominant in 1989, the number of charged amino acids increased in epitope A and decreased in epitope B. Interestingly, the number of charged residues in the dominant epitope of the dominant circulating strain is never fewer than that in the vaccine strain. We propose these results indicate selective pressure for charged amino acids that increase the affinity of the virus epitope for water and decrease the affinity for host antibodies. The standard PAM model of generic protein evolution is unable to capture these trends. The reduced alphabet Markov model (RAMM) model we introduce captures the increased selective pressure for charged amino acids in the dominant epitope of hemagglutinin of H3N2 influenza (R2 > 0.98 between 1968 and 1988). The RAMM model calibrated to historical H3N2 influenza virus evolution in humans fit well to the H3N2/Wyoming virus evolution data from Guinea pig animal model studies

    PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals

    Get PDF
    Recent statistical analyses suggest that sequencing of pooled samples provides a cost effective approach to determine genome-wide population genetic parameters. Here we introduce PoPoolation, a toolbox specifically designed for the population genetic analysis of sequence data from pooled individuals. PoPoolation calculates estimates of θWatterson, θπ, and Tajima's D that account for the bias introduced by pooling and sequencing errors, as well as divergence between species. Results of genome-wide analyses can be graphically displayed in a sliding window plot. PoPoolation is written in Perl and R and it builds on commonly used data formats. Its source code can be downloaded from http://code.google.com/p/popoolation/. Furthermore, we evaluate the influence of mapping algorithms, sequencing errors, and read coverage on the accuracy of population genetic parameter estimates from pooled data

    Changes in Gene Expression Associated with Reproductive Maturation in Wild Female Baboons

    Get PDF
    Changes in gene expression during development play an important role in shaping morphological and behavioral differences, including between humans and nonhuman primates. Although many of the most striking developmental changes occur during early development, reproductive maturation represents another critical window in primate life history. However, this process is difficult to study at the molecular level in natural primate populations. Here, we took advantage of ovarian samples made available through an unusual episode of human–wildlife conflict to identify genes that are important in this process. Specifically, we used RNA sequencing (RNA-Seq) to compare genome-wide gene expression patterns in the ovarian tissue of juvenile and adult female baboons from Amboseli National Park, Kenya. We combined this information with prior evidence of selection occurring on two primate lineages (human and chimpanzee). We found that in cases in which genes were both differentially expressed over the course of ovarian maturation and also linked to lineage-specific selection this selective signature was much more likely to occur in regulatory regions than in coding regions. These results suggest that adaptive change in the development of the primate ovary may be largely driven at the mechanistic level by selection on gene regulation, potentially in relationship to the physiology or timing of female reproductive maturation

    Predicting disease-associated substitution of a single amino acid by analyzing residue interactions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The rapid accumulation of data on non-synonymous single nucleotide polymorphisms (nsSNPs, also called SAPs) should allow us to further our understanding of the underlying disease-associated mechanisms. Here, we use complex networks to study the role of an amino acid in both local and global structures and determine the extent to which disease-associated and polymorphic SAPs differ in terms of their interactions to other residues.</p> <p>Results</p> <p>We found that SAPs can be well characterized by network topological features. Mutations are probably disease-associated when they occur at a site with a high centrality value and/or high degree value in a protein structure network. We also discovered that study of the neighboring residues around a mutation site can help to determine whether the mutation is disease-related or not. We compiled a dataset from the Swiss-Prot variant pages and constructed a model to predict disease-associated SAPs based on the random forest algorithm. The values of total accuracy and MCC were 83.0% and 0.64, respectively, as determined by 5-fold cross-validation. With an independent dataset, our model achieved a total accuracy of 80.8% and MCC of 0.59, respectively.</p> <p>Conclusions</p> <p>The satisfactory performance suggests that network topological features can be used as quantification measures to determine the importance of a site on a protein, and this approach can complement existing methods for prediction of disease-associated SAPs. Moreover, the use of this method in SAP studies would help to determine the underlying linkage between SAPs and diseases through extensive investigation of mutual interactions between residues.</p
    corecore